# imports
from google.colab import drive
from sklearn.model_selection import KFold, cross_val_score
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier
from IPython.display import Image
import matplotlib.pyplot as plt
import matplotlib
import numpy as np
import pandas as pd
import seaborn as sns
The goal of this challenge is to recognize very little organisms living in the oceans: planktons.
Provided with a set of images, metadata and also features extracted from the images by the Laboratoire d’Océanographie de Villefranche, we decided to treat this problem as a classification problem. A discussion about the classes and the taxonomy can be found in the Data Exploration part of this report.
To deal with this challenge, two main approaches were explored:
Our work is divided in three main parts:
drive.mount('/content/drive/',force_remount=True)
The taxonomy tree is huge. It aims to hierarchically identify all that could possibly be a plankton. The first separation is the living / not living. Then the living creatures are separated into Bacteria /Eukaryota / other. Beyond this, our biological knowledge is hopeless.
As said in the challenge, our target values were 'level2' column of the metadata.
taxo = pd.read_csv('/content/drive/My Drive/AML/Chal2 - Plankton/Data/taxo.csv')
meta = pd.read_csv('/content/drive/My Drive/AML/Chal2 - Plankton/Data/meta.csv')
features_native = pd.read_csv('/content/drive/My Drive/AML/Chal2 - Plankton/Data/features_native.csv')
features_skimage = pd.read_csv('/content/drive/My Drive/AML/Chal2 - Plankton/Data/features_skimage.csv')
X_features_10p_train = pd.read_csv('/content/drive/My Drive/AML/Chal2 - Plankton/Data/X_features_10p_train.csv')
X_features_10p_test = pd.read_csv('/content/drive/My Drive/AML/Chal2 - Plankton/Data/X_features_10p_test.csv')
y_features_10p_train = pd.read_csv('/content/drive/My Drive/AML/Chal2 - Plankton/Data/y_features_10p_train.csv')
y_features_10p_test = pd.read_csv('/content/drive/My Drive/AML/Chal2 - Plankton/Data/y_features_10p_test.csv')
clean_meta = meta.dropna(subset=['level2']) #deleting the rows with a NaN value in level2
meta_lvl2 = clean_meta[['objid','level2']]
fig = plt.figure(figsize=(10,10))
chart = plt.pie(x=meta_lvl2['level2'].value_counts())
plt.title('level2: Pie Chart Distribution', fontsize=20)
plt.legend(chart[0], labels=meta_lvl2['level2'].value_counts().index)
plt.show()
It is interesting to note that more than half the images are of detritus and that close to 75% of the images are either detritus or feces.
This pie chart shows that our dataset is very much imbalanced. The use of the macro average f1 score is then questionable. The micro-average would capture this class imbalance, but the macro average alone cannot.
Still, it demonstrates how the system performs overall across the sets of data.
In addition to the images, the meta.csv file containing metadata about the images, and the taxo.csv file defining the taxonomy tree we have two other csv files.
features_native.csv.gz
features_skimage.csv.gz
These two files contain information extracted from the images in form of features. This ZooProcess method is a software developped by engineers from the Laboratoire d’Océanologie de Villefranche sur mer.
meta.isna().sum().sort_values(ascending=False)
There are 1003 NaN values in the column level2, and 3334 in the column level1. Since 'level2' will be our target, we cannot use the images corresponding to these NaN values; it is a supervised learning problem. We deleted the rows corresponding to these NaN values, using the objid identifier.
Also, in the two features files, we removed the rows corresponding to the unclassified images.
train_na = (features_skimage.isnull().sum() / len(features_skimage)) * 100
train_na = train_na.drop(train_na[train_na == 0].index).sort_values(ascending=False)[:30]
missing_data = pd.DataFrame({'Table 1: Missing Ratio' :train_na})
missing_data.head(20)
features_native.isnull().sum().sort_values(ascending=False)[:20]
Image(filename='/content/drive/My Drive/AML/Chal2 - Plankton/Plot/Distrib_nb_rows.png', width=800, height=400)
Image(filename='/content/drive/My Drive/AML/Chal2 - Plankton/Plot/Distrib_nb_columns.png', width=800, height=400)
On the plots above there is two things to notice:
The meta dataset can be useful, but let's first remove all columns that are useless. There is only information relative to the images, the projects (when the image was taken for example). We will only keep objid and the label, hence level2.
Here are the steps we took in order to create an exploitable dataset:
As we will describe in the Model Selection part, we had two apporaches using images (a 'handmade' CNN and a pre-trained model from Keras). Here is how we processed the data for each of these approaches.
The distribution of height and width of the images indicate that they are respectively around 80 and 60. (The values of the medians are 87 rows and 67 columns).
In order to have a lower computational cost, while still managing to be realistic, and being able to use all images we have at our diposition, we decided to:
Then we would build our CNN according to these sizes of inputs.
After having resized and created all this images, we will store all of them along with their labels in a list that we will turn into an array later on because it is less harmful to the memory.
For this pre-trained model, we had to deal with one major constraint. It takes as inputs images of size (224,224,3) in (height, width, channels).
N.B. When we tried to create too big array the environnment will just shut down and we will need to run it again.
Image(filename='/content/drive/My Drive/AML/Chal2 - Plankton/Plot/Pie_chart.png', width=900, height=400)
The two pie charts above represent classes distribution before and after data augmentation. We can see that thanks to data augmentation we gave more importance to the 'small' classes.
Image(filename='/content/drive/My Drive/AML/Chal2 - Plankton/Plot/Images.png', width=900, height=500)
The images above are examples from our dataset, before and after resizing.
To begin with, we decided to use cross-validation strategy for this approach. Indeed, we noticed that models tend to overfit, giving excellent results for the train set but poor results when tested.
After trying different models, we decided to stick with, and try to optimize k-Nearest-Neighbors algorithm as an easy to understand classifier. Also, we thought that Random Forest Classifier would be able to imitate the taxonomy tree in a way.
However, as we will describe briefly, we obtained very poor results, probably because the extracted features from the images don't manage to extract totally the distribution and organization of the data and its classes. That is why, as you'll see in the next part we focused more on Convolutional Neural Networks, that work directly on the images.
#Validation function
def f1_cv(model, X, y):
kf = KFold(5, shuffle=True, random_state=42).get_n_splits(X)
f1 = cross_val_score(model, X, y, scoring="f1_macro", cv = kf)
return(f1)
train_results = []
neighbors = [n for n in range(1,20)]
for n in neighbors:
knn = KNeighborsClassifier(n_neighbors=n)
knn.fit(X_features_10p_train, y_features_10p_train)
train_pred = knn.predict(X_features_10p_train)
train_results.append(np.mean(f1_cv(knn, X_features_10p_train, y_features_10p_train)))
print(n)
fig = plt.figure(figsize=(10,6))
plt.title('Cross-validation macro f1-score wrt number of neighbors in kNN', fontsize=10)
plt.plot(neighbors, train_results)
plt.ylabel('F1 CV')
plt.xlabel('neighbors')
plt.show()
As we can see above, the kNN classifier gave middling results, and for the small number of neighbors, there is probably some overfitting taking place...
rfc = RandomForestClassifier(max_depth=50, min_samples_leaf=1, min_samples_split=2, n_estimators=1550)
f1_rfc = f1_cv(rfc, X_features_10p_train, y_features_10p_train)
pd.DataFrame([['Random Forest Classifier', np.mean(f1_rfc)]],columns=['Model','Cross-Validation f1 score'])
Same as for kNN, a Random Forest Classifier did not manage to give great results even with some hypertuning using GridSearchCV. These reults are only here to show that the features approach was not conclusive for us.
We believe that using directly the images with Convolutional Neural Networks is probably the way to go here.
results = pd.read_csv(('/content/drive/My Drive/AML/Chal2 - Plankton/Data/final_results.csv')).drop('Unnamed: 0', axis=1)
We trained both of our CNN models on 10 epoch. If you want more information on the training, feel free to have a look at the scratch notebook.
First according to what we could have seen before in the data exploration, we wanted to build a CNN that takes as an input an image which is fairly adjusted to the meidan sizes of the images from our dataset.
You can see below a description of the CNN's structure.
Image(filename='/content/drive/My Drive/AML/Chal2 - Plankton/Plot/CNNsoftmax_architechture.png', width=500, height=500)
Secondly, we realized that it is long to correctly train a CNN and make it fit our dataset even if the input shape fits with the average shape. But we were limited on time. So we decided to work on transfer learning and use a pre-trained model. A pre-trained model is a model that has been trained for a long time on a really large data-set that gather all kind of pictures. So the weights are already initialized and it is time-saving. The deeper you go through your CNN the more complex the forms you can identify are. So to specialize our pre-trained CNN on the plankton, we will just modify the last fully connected layer by training the CNN with our pre-processed images.
You can see below a description of the pre-trained CNN's structure.
Image(filename='/content/drive/My Drive/AML/Chal2 - Plankton/Plot/VGG16_architechture.png', width=550, height=900)
results[['Model','Training execution time (min)', 'Number of images involved', 'f1_score']]
We can see that our handmade CNN takes a fifth of the time with more than 10 times the number of images. But, the obtained results are way better for the VGG16 model.
First as we compare the features approach to the images approach we can notice that we have pretty bad results on the features approach. Moreover we have issues with the understanding of the feature while in the images approach we feel more confident about what is happening and what criteria makes the classification. So we decided to choose the side of the images approach.
Secondly, even though it is longer to train the pre-trained we decided to go along with the pre-trained VGG16 CNN model. According to the lack of time we had to fully train a CNN from scratch we won't be able to obtain a good CNN without using a pre-trained one. Since the beginning, it gave results close to those from the features approach.
Let's now have a look on the performance of the VGG16 pre-trained model.
Image(filename='/content/drive/My Drive/AML/Chal2 - Plankton/Plot/Training_validation_loss.png', width=500, height=300)
Image(filename='/content/drive/My Drive/AML/Chal2 - Plankton/Plot/Best_confusion_matrix.png', width=1000, height=1000)
The confusion matrix (comparing the ratio of true and predicted labels) shows that there are a lot of rightly classified images as we have high numbers on the diagonal. But we also have a lot of misclassified images considered as a 'detritus'. This is due to the predominance of the detritus in our dataset. So our CNN happens to be really sensitive to detritus.
results[results['Model']=='VGG16']
We finally obtained a f1_score of 0.565 on the testing set which is quite a good result.
We can still improve these results by augmenting more the data because the detritus predominance makes our CNN sensitive to it.
We can also play on the hyperparameters of the CNN computation but it will take some time to run all of them and find the best one.
Finally we could have to tried some more tricks to have some more images in our numpy array while we train the VGG16 model. Indeed we were limited to 15.000 images because of RAM issues.
We stuck to this pre-trained model because it gave us good results quite fastly, but with more time and computation, a CNN made from scratch, adapted to this problem could give greater results.